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Abstract 

We present statistical methods for big data arising from online analytical process¬ 
ing, where large amounts of data arrive in streams and require fast analysis without 
storage/access to the historical data. In particular, we develop iterative estimat¬ 
ing algorithms and statistical inferences for linear models and estimating equations 
that update as new data arrive. These algorithms are computationally efficient, 
minimally storage-intensive, and allow for possible rank deficiencies in the subset 
design matrices due to rare-event covariates. Within the linear model setting, the 
proposed online-updating framework leads to predictive residual tests that can be 
used to assess the goodness-of-fit of the hypothesized model. We also propose a new 
online-updating estimator under the estimating equation setting. Theoretical prop¬ 
erties of the goodness-of-fit tests and proposed estimators are examined in detail. In 
simulation studies and real data applications, our estimator compares favorably with 
competing approaches under the estimating equation setting. 
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1 Introduction 


The advancement and prevalence of compnter technology in nearly every realm of science 
and daily life has enabled the collection of “big data”. While access to snch wealth of 
information opens the door towards new discoveries, it also poses challenges to the cnrrent 
statistical and compntational theory and methodology, as well as challenges for data storage 
and compntational efficiency. 

Recent methodological developments in statistics that address the big data challenges 


have largely focnsed on snbsampling-based (e.g., Kleiner et ah, 2014 Liang et ah, 2013 Ma 


et ah, 2013) and divide and conqner (e.g., Lin and Xi, 2011 Gnha et ah, 2012| Chen and 


Xie, 2014) techniqnes; see Wang et ah (2015[) for a review. “Divide and conqner” (or “divide 


and recombine” or ‘split and conqner”, etc.), in particnlar, has become a popnlar approach 
for the analysis of large complex data. The approach is appealing becanse the data are 
first divided into snbsets and then nnmeric and visnalization methods are applied to each 
of the snbsets separately. The divide and conqner approach cnlminates by aggregating the 
resnlts from each snbset to prodnce a final solntion. To date, most of the focns in the 
final aggregation step is in estimating the nnknown qnantity of interest, with little to no 
attention devoted to standard error estimation and inference. 

In some applications, data arrives in streams or in large chnnks, and an online, seqnen- 
tially npdated analysis is desirable withont storage reqnirements. As far as we are aware, 
we are the first to examine inference in the online-npdating setting. Even with big data, 
inference remains an important issne for statisticians, particnlarly in the presence of rare- 
event covariates. In this work, we provide standard error formnlae for divide-and-conqner 
estimators in the linear model (LM) and estimating eqnation (EE) framework. We far¬ 
ther develop iterative estimating algorithms and statistical inferences for the LM and EE 
frameworks for online-npdating, which npdate as new data arrive. These algorithms are 
compntationally efficient, minimally storage-intensive, and allow for possible rank deficien¬ 
cies in the snbset design matrices dne to rare-event covariates. Within the online-npdating 
setting for linear models, we propose tests for ontlier detection based on predictive residn- 
als and derive the exact distribntion and the asymptotic distribntion of the test statistics 
for the normal and non-normal cases, respectively. In addition, within the online-npdating 


2 






























































setting for estimating eqnations, we propose a new estimator and show that it is asymptot¬ 
ically consistent. We fnrther establish new nniqneness resnlts for the resnlting cnmulative 
EE estimators in the presence of rank-dehcient snbset design matrices. Our simulation 
study and real data analysis demonstrate that the proposed estimator outperforms other 
divide-and-conquer or online-updated estimators in terms of bias and mean squared error. 

The manuscript is organized as follows. In Section we hrst briefly review the divide- 
and-conquer approach for linear regression models and introduce formulae to compute the 
mean square error. We then present the linear model online-updating algorithm, address 
possible rank dehciencies within subsets, and propose predictive residual diagnostic tests. In 
Section]^ we review the divide-and-conquer approach of Lin and Xi (2011) for estimating 
equations and introduce corresponding variance formulae for the estimators. We then 
build upon this divide-and-conquer strategy to derive our online-updating algorithm and 
new online-updated estimator. We further provide theoretical results for the new online- 
updated estimator and address possible rank dehciencies within subsets. Section [^contains 
our numerical simulation results for both the LM and EE settings, while Section [^contains 
results from the analysis of real data regarding airline on-time statistics. We conclude with 
a brief discussion. 


2 Normal Linear Regression Model 

2.1 Notation and Preliminaries 

Suppose there are N independent observations {(?/*, x*), i = 1, 2,..., A^} of interest and we 
wish to ht a normal linear regression model yi = x'/3-|-ej, where ~ X(0, independently 
for i = 1, 2,..., iV, and /3 is a p-dimensional vector of regression coefficients corresponding 
to covariates Xj {p x 1). Write y = (|/i, 2 / 2 , ■ ■ ■ iUn)' and X = (xi,X 2 ,... ,xjv)' where we 
assume the design matrix X is of full rank p < N. The least squares (LS) estimate of 
f3 and the corresponding residual mean square, or mean squared error (MSE), are given 
by P = (X'X)"^X'y and MSE = ]^y'(lAf “ H)y, respectively, where Iat is the N x N 
identity matrix and H = X(X'X)“^X'. 

In the online-updating setting, we suppose that the N observations are not available all 
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at once, but rather arrive in chunks from a large data stream. Suppose at each accumulation 
point k we observe and X^, the rifc-dimensional vector of responses and the nk x p 
matrix of covariates, respectively, for k = such that y = (y'^, y 2 ,..., y^)' and 

X = (X'^, X 2 ,..., X'^)'. Provided X^ is of full rank, the LS estimate of /3 based on the k^^ 
subset is given by 

= (Xl.XO-‘X;y» (1) 

and the MSE is given by 




nk-p 


y'kC^n, - Hfc)yfc, 


( 2 ) 


where = Xfc(X;Xfc)-^X;, for fc = 1, 2,..., X 
As in the divide-and-conquer approach (e.g.. 


Lin and Xi 


2011), we can write (3 as 


K -1 ^ 


'nk,k' 


(3) 


k=l k=l 

We provide a similar divide-and-conquer expression for the residual sum of squares, or sum 
of squared errors (SSE), given by 

K K . K , K 




(4) 

k=l k=l k=l k=l 

and MSE = SSE/(iV — p). The SSE, written as in (|^, is quite useful if one is interested 
in performing inference in the divide-and-conquer setting, as var(/3) may be estimated 
by MSE(X'X)-i = MSe( Ylk=i^'k^k) ^ We will see in Section 2.2 


that both ^ in ([^ 

and SSE in Q may be expressed in sequential form that is more advantageous from the 
perspective of online-updating. 


2.2 Online Updating 

While equations (|^ and Q are quite amenable to parallel processing for each subset, the 
online-updating approach for data streams is inherently sequential in nature. Equations ([^ 
and Q can certainly be used for estimation and inference for regression coefficients resulting 
at some terminal point K from a data stream, provided quantities (X'^^X^,y^y^) are 
available for all accumulation points k = 1,... ,K. However, such data storage may not 
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always be possible or desirable. Furthermore, it may also be of interest to perform inference 
at a given accumulation step k, using the k subsets of data observed to that point. Thus, 
our objective is to formulate a computationally efficient and minimally storage-intensive 
procedure that will allow for online-updating of estimation and inference. 


2.2.1 Online Updating of LS Estimates 


While our ultimate estimation and inferential procedures are frequentist in nature, a 
Bayesian perspective provides some insight into how we may construct our online-updating 
estimators. Under a Bayesian framework, using the previous k — \ subsets of data to 
construct a prior distribution for the current data in subset /c, we immediate identify the 
appropriate online updating formulae for estimating the regression coefficients (3 and the 
error variance cr^ with each new incoming dataset {yk,^k)- The Bayesian paradigm and 
accompanying formulae are provided in the Supplementary Material. 

Let l3j^ and MSE^. denote the LS estimate of /3 and the corresponding MSE based on 
the cumulative data = {(y^, X^), i = 1, 2,..., k}. The online-updated estimator of /3 
based on cumulative data is given by 


A = (xiXj + Vn)-'(x;xtA . + Vt-A-i), 


( 8 ) 


where Pq = 0 , is dehned by Q or Q, = Yl£=i X^X^ for A; = 1, 2,..., and Vq = Op 
a p X p matrix of zeros. Although motivated through Bayesian arguments, ([^ may also 


IS 


be found in a (non-Bayesian) recursive linear model framework (e.g., Stengel, 2012, page 
313). 

The online-updated estimator of the SSE based on cumulative data Dk is given by 
SSEfc = SSE,_i + SSE„„, + p[_,Vk-iPk-i + - Pk^kPk (6) 


where SSE^^. ^ is the residual sum of squares from the dataset, with corresponding 
residual mean square MSE^^^^ =SSE„j,_fc/(nfc —p). The MSE based on the data Dk is then 
MSEfc = SSEfc/(A"fc — p) where Nk = (= + ^k-i) for A; = 1, 2,.... Note that for 

k = K, equations (|^ and ([^ are identical to those in (|^ and Q, respectively. 

Notice that, in addition to quantities only involving the current data (yfc,Xfc) (i.e., 
^n^^k^ SSE„^,fc, XfcXfc, and nk), we only used quantities SSE^.i, Vfc_i, from 
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the previous accumulation point to compute and MSE^. Based on these online-updated 
estimates, one can easily obtain online-updated t-tests for the regression parameter esti¬ 
mates. Online-up dated ANOVA tables require storage of two additional scalar quantities 
from the previous accumulation point; details are provided in the Supplementary Material. 


2.2.2 Rank Deficiencies in 


When dealing with subsets of data, either in the divide-and-conquer or the online-updating 
setting, it is quite possible (e.g., in the presence of rare event covariates) that some of the 
design matrix subsets will not be of full rank, even if the design matrix X for the entire 
dataset is of full rank. For a given subset fc, note that if the columns of X^ are not linearly 
independent, but lie in a space of dimension < p, the estimate 

= (x;xo-x;y», (7) 

where (X),Xfc)“ is a generalized inverse of (X)j,Xfc) for subset k, will not be unique. However, 
both /3 and MSE will be unique, which leads us to introduce the following proposition. 


Proposition 2.1 Suppose X is of full rank p < N. If the columns oflKk are not linearly 
independent, but lie in a space of dimension Qk < p for any k = 1,K, ^ in ^ and SSE 
Q using $nk,k invariant to the choice of generalized inverse (X'^X^)” of 

(X',X,). 


To see this, recall that a generalized inverse of a matrix B, denoted by B^, is a matrix 
such that BB^B = B. Note that for (X).Xfc)“, a generalized inverse of (X).Xfc), k 
given in Q is a solution to the linear system (X'ffK.k)f3i. = X'^y^. It is well known that if 
(x',x,)- is a generalized inverse of (X'^X^.), then Xfc(X'^Xfc) X'^ is invariant to the choice 
of (X'^Xfc)” (e.g., Searle, 1971, p20). Both ([^ and (|^ rely on ^nk,k through product 


X.'^'XkPn^^k = XfcXfc(X),Xfc) X'^yfc = X).yfc which is invariant to the choice of (X'^X*,) 


Remark 2.2 The online-updating formulae ([^ and ([^ do not require X(,Xfc for all k to 
be invertible. In particular, the online-updating scheme only requires V^, = X^X^ to 
be invertible. This fact can be made more explicit by rewriting ([^ and (§. respectively, as 

h = (X',X, + V,_i)-i(X;y, + W,_i) = V-^(X',yfc + W,_i) 

SSEk = SSEk-i+y'kyk + Pk-i^k-ih-i-^^k^kh 
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( 8 ) 

(9) 





where Wq = 0 and W^. = Yl!l=i A; = 1, 2,.... 


Remark 2.3 Following Remark and using the Bayesian motivation discussed in the 
Supplementary Material, i/Xi is not of full rank (e.g., due to a rare event covariate), 
we may consider a regularized least sguares estimator by setting Vq ^ Op. For example, 
setting Vq = Alp, A > 0, with /Xq = 0 would correspond to a ridge estimator and could he 
used at the beginning of the online estimation process until enough data has accumulated; 
once enough data has accumulated, the biasing term Vq = Alp may be removed such that 
the remaining seguence of updated estimators and MSE^ are unbiased for jS and 
respectively. More specifically, set Vk = (note that the summation starts at 

i = 0 rather than i = 1) where XqXq = Vq, keep (Bq = 0, and suppose at accumulation 
point K we have accumulated enough data such that X^ is of full rank. For k < k and 
Vo = Alp, A > 0, we obtain a (biased) ridge estimator and corresponding sum of sguared 
errors by using (|^ and 0 or ([^ and 0, At k = K, we can remove the bias with, e.g.. 


= (X',X. + V._i-Vo)-'(X;y. + W._i) 


( 10 ) 

( 11 ) 


and then proceed with original updating procedure for k > k to obtain unbiased estimators 
of jS and 


2.3 Model Fit Diagnostics 


While the advantages of saving only lower-dimensional snmmaries are clear, a potential dis¬ 
advantage arises in terms of difficnlty performing classical residnal-based model diagnostics. 
Since we have not saved the individnal observations from the previous {k — 1) datasets, 
we can only compute residuals based upon the current observations (y^, X^). For example, 
one may compute the residuals eki = yn — Oki, where i = 1,... ,nk and ijki = ^ki^n^,k) 
even the externally studentized residuals given by 


tki 


^ki 




- hk,ii) LSSE„„fc(l - hk,n) - e 


Uk-p-l 


nl/2 


ki 


( 12 ) 


where hk^u = Diag(Hfc)j = Diag(Xfc(X'^Xfc)X).)j and MSEnkMi) 1^^® MSE computed from 
the k^^ subset with the observation removed, i = 1,... ,nk. 
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However, for model fit diagnostics in the online-update setting, it would arguably be 
more useful to consider the predictive residuals, based on from data -Dfc-i with pre¬ 
dicted values Yk = {Vki, • • •, VknJ = y^kPk-i^ Cki = i/ki - Vki, i = I,..., Uk- Dehne the 
standardized predictive residuals as 

iki = 4*/A/var(4i), i = l,...,nfc. (13) 


2.3.1 Distribution of standardized predictive residuals 

To derive the distribution of iki, we introduce new notation. Denote = (y(,..., 
and Xk-i and e^-i the corresponding Nk-i xp design matrix of stacked X^, i = 1,..., k — 1, 
and Nk-i x 1 random errors, respectively. For new observations yfc,Xfc, we assume 


Yk = ^k^ + Cfc, (14) 

where the elements of are independent with mean 0 and variance independently of 
the elements of Sk-i which also have mean 0 and variance Thus, E{eki) = 0, var(efcj) = 
0 - 2(1 -1- x'^i{X'^_^Xk-i)~^Xki) for i = 1,..., Uk, and 

var(efe) = + Xk{Xi_,X k-^y^X^k) 


where = (4i,... ,4nJ'- 

If we assume that both and Sk-i are normally distributed, then it is easy to show that 
e),var(efc)“^efc ~ xi^- Thus, estimating ^ with MSEfc_i and noting that ~ 

independently of e),var(efc)“^efc, we hnd that iki ~ 


Fk : = 




ni-MSEt-i 


nfe,7Vfe_i-p- 


(15) 


If we are not willing to assume normality of the errors, we introduce the following propo¬ 
sition. The proof of the proposition is given in the Supplementary Material. 


Proposition 2.4 Assume that 


1. Ci, i = l,...,nfc, are independent and identically distributed with E{ei) = 0 and 



2. the elements of the design matrix are uniformly bounded, i.e., \Xij\ < (7, V i,j, 
where C < oo is constant; 


3. lim 

Affe_i^oo 


= Q, where Q is a positive definite matrix. 


Let = r where TF' = I„, + Xk{X'^_iXk-i) ^X'^. Write e^' = (e^/,..., 
where e^. is an rik^ x 1 vector consisting of the + 1)^^ component through the 

{J2]=i'^ke)th component of and = '^k- further assume that 

Ti]^. 

4- lim —^ = Ci, where 0 < Cj < oo is constant for i = 1,... ,m. 

rik^oo nk 


Then at accumulation point k, we have 

V™ p* 'l2 

Wii=l ni,. Wkiki) 


MSEk-i 

where 1^. is an riki x 1 vector of all ones. 


d 2 


as Hk, Nk-l -)■ OO, 


(16) 


2.3.2 Tests for Outliers 


Under normality of the random errors, we may nse statistics tki in (13) and Fk in (15) to 
test individnally or globally if there are any ontliers in the dataset. Notice that iki in 


(13) and Fk in (15) can be re-expressed eqnivalently as 


iu = en/\/MSEi_,(l + iii(Vi_,)->a:n) 
ei(I„.+Xi(Vi_,)->Xi)->ei 


Ft = 


(17) 

(18) 


nfcMSEfc_i 

and thns can both be computed with the lower-dimensional stored summary statistics from 
the previous accumulation point. 

We may identify as outlying yki observations those cases whose standardized predicted 
iki are large in magnitude. If the regression model is appropriate, so that no case is 
outlying because of a change in the model, then each iki will follow the t distribution with 
Nk-i — p degrees of freedom. Let pki = P(|tArj._j_p| > \iki\) be the unadjusted p-value 


and let pki be the corresponding adjusted p-value for multiple testing (e.g., Benjamini and 


Hochberg, 1995 Benjamini and Yekutieli, 2001). We will declare yki an outlier if pki < a 


for a prespecified a level. Note that while the Benjamini-Hochberg procedure assumes the 
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multiple tests to be independent or positively correlated, the predictive residuals will be 
approximately independent as the sample size increases. Thus, we would expect the false 
discovery rate to be controlled with the Benjamini-Hochberg p-value adjustment for large 
Nk-i. 

To test if there is at least one outlying value based upon null hypothesis Hq : E{ek) = 0, 
we will use statistic Fk- Values of the test statistic larger than F(1 — a, Uk, N^-i —p) would 
indicate at least one outlying pki exists among i = 1,..., at the corresponding a level. 

If we are unwilling to assume normality of the random errors, we may still perform a 


global outlier test under the assumptions of Proposition 2.4 Using Proposition 2.4 and 


following the calibration proposed in Muirhead (1982) (Muirhead, 2009, page 218), we 
obtain an asymptotic F statistic 




2^1=1 rtfe. y^kiki 


* N 2 


-Vfe-i — m + 1 d , . , . 

—>■ F (m, — m + 1), as —)■ oo. (19) 


MSEfc_i Nk-i ■ m 

Values of the test statistic Fj^ larger than F(1 — a, m, Nk_i — m + 1) would indicate at least 
one outlying observation exists among at the corresponding a level. 

Remark 2.5 Recall that var{ek) = (Ins, + = Pr'cr^, where P is 

an Uk X Uk invertible matrix. For large Uk, it may he challenging to compute the Cholesky 
decomposition of var{ek). One possible solution that avoids the large nk issue is given in 
the Supplementary Material. 

3 Online Updating for Estimating Equations 

A nice property in the normal linear regression model setting is that regardless of whether 
one “divides and conquers” or performs online updating, the hnal solution will be the 
same as it would have been if one could £t all of the data simultaneously and obtained /3 
directly. However, with generalized linear models and estimating equations, this is typically 
not the case, as the score or estimating functions are often nonlinear in /3. Consequently, 
divide and conquer strategies in these settings often rely on some form of linear approx¬ 
imation to attempt to convert the estimating equation problem into a least square-type 


problem. For example, following Lin and Xi (2011), suppose N independent observations 
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{zj, i = 1,2,..., N}. For generalized linear models, Zj will be (?/*, Xj) pairs, i = 1,..., iV 
with Eiiji) = for some known function g. Suppose there exists /3o G such that 

/3o)] = 0 for some score or estimating function -0. Let denote the solution 
to the estimating equation (EE) 


N 


^(/5) = = 0 


Z=1 


and let Vat be its corresponding estimate of covariance, often of sandwich form. 

Let {zki,i = 1,... ,nfc} be the observations in the kth subset. The estimating function 
for subset k is 

rik 

MukAf^) = / 3 ). ( 20 ) 


2=1 


Denote the solution to = 0 as finkM' define 


^a^(zfci,/3 J 

= - 2 ^ - 
2=1 


d(3 


( 21 ) 


a Taylor Expansion of at is given by 


Mnj.,k{l3) — -^nk,k{l3 ^nk,k) T 

as M„,,fc(/3n,,fc) = 0 and Rns,,fc is the remainder term. As in the linear model case, we 
do not require to be invertible for each subset k, but do require that is 


invertible. Note that for the asymptotic theory in Section 3.3, we assume that A^^ ^ is 
invertible for large rik- For ease of notation, we will assume for now that each A^^^fc is 


invertible, and we will address rank dehcient An^^k in Section 3.4 below. 


The aggregated estimating equation (AEE) estimator of Lin and Xi (2011) combines 
the subset estimators through 


K 


K 


^NK ~ I ^ ^ Anf.,k I ^ ^ ^rik,k^nk.k 


( 22 ) 


.fc=i 


k=l 


which is the solution to Anfe,A;(/3 — Pnk,k) = 0- Li^ ^^^1 Xi (2011) did not discuss a 
variance formula, but a natural variance estimator is given by 


K 


-1 


K 


^NK — 


Ank,k j An;,,fc'^Tifc,fcAy^^ j. 


.fc=l 


k=l 


K 


An*,,A 


.A:=l 


(23) 
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where is the variance estimator of /3„^ from the subset k. If is of sandwich form, 
it can be expressed as where is an estimate of = var(M„j^,fc(/3))- 

Then, the variance estimator becomes 


K 


K 


^NK - I y^,-^nu,k I 

^ fc=l 


k=l 


K 


^nk,k 


.fc=l 


(24) 


which is still of sandwich form. 


3.1 Online Updating 

Now consider the online-updating perspective in which we would like to update the es¬ 
timates of /3 and its variance as new data arrives. For this purpose, we introduce the 
cumulative estimating equation (CEE) estimator for the regression coefficient vector at 
accumulation point k as 

^k = (Afc_l -I- Ani^^k) ^{Ak-l^k-1 + ^nk,k^nk,k)- (25) 

for A; = 1, 2,... where = 0, Aq = Op, and = M-i + 

For the variance estimator at the k^^ update, we take 


(Afc_i -|- A.nk,k) (Afc_iA^_J^ -|- A.n^.,k^nk,k-^nk,k)[i-^k:-l + 


\-llT 


(26) 


with Vq = Op and Aq = Op. 


By induction, it can be shown that (25) is equivalent to the AEE combination (22) 


when k = K, and likewise (26) is equivalent to (24) (i.e., AEE=CEE). However, the AEE 
estimators, and consequently the CEE estimators, are not identical to the EE estimators 
and Vjv based on all N observations. It should be noted, however, that 


Lin and Xi 


(2011) did prove asymptotic consistency of AEE estimator under certain regularity 


conditions. Since the CEE estimators are not identical to the EE estimators in hnite sample 
sizes, there is room for improvement. 

Towards this end, consider the Taylor expansion of —M„j, fc(/3) around some vector 
to be dehned later. Then 


■^nfc,A:(/3) — "^nj.,fc(^nfc,A:) + [Anfe,fc(^nfc,fc)](/3 ^nfc,fc) + 
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with Rnj,,fc denoting the remainder. Denote as the solution of 

K K 

Y .= 0. (27) 

k=\ k=\ 

Dehne = [^nk,k{,^nk,k)] and assume An^^k refers to Then we have 

r K 1 f ^ ^ 

E -^nk,k^nk,k T ( 28 ) 


. k=l 


. k=l 


k=l 


If we choose iSn^^k = f^n^,ky then f3j^ in (28) reduces to the AEE estimator of 


Lin and 


Xi 


(|2011|) in (1^, as reduces to Ylk=i “ (^n^,k) = 0 because Mn^^kiPn^^k) = 0 


for all fc = 1,..., A. However, one does not need to choose fink,k = online¬ 

updating setting, at each accumulation point k, we have access to the summaries from the 
previous accumulation point k — 1, so we may use this information to our advantage when 
dehning Pnk,k- Consider the intermediary estimator given by 

k-l 

^nk,k ~ (Afc-l -|- An^A ^ Ani/Pne,i ~k Anfc,fe^rn.,fc) 


l=l 


for A; = 1, 2,..., Ao = Op, = 0, and Afc = YAt=i ^ni,i- Estimator ( [^ combines the 
previous intermediary estimators £ = 1,..., /c — 1 and the current subset estimator 

0nk^k^ ^iid arises as the solution to the estimating equation 

k-l 

'y ^ ~ PruA T -^nfe,fc(/3 ~ Pnj^,k) ~ 


i=l 


where An^AP~Pnk,k) serves as a bias correction term due to the omission of — YAiJi ^n^APukA 
from the equation. 


With the choice of [3 ^ given in (29), we introduce the cumulatively updated esti¬ 


mating equation (CUEE) estimator as 


Pk ~ (-^fc-l + An^.A (^fc-1 + ^n^.,k0nk,k + bfc_i -|- Mnf.Af^nk,k)^ 


(30) 


with afe = Y! 1=1 An,,k$n^,k = An^,k$n,,k+^k-l and bfc = Mn.A^n^A = Mn^A^n^,k) + 

bfc_i where Uq = bg = 0, Ag = Op, and k = 1,2, ■ ■ ■ ■ Note that for a terminal k = K, (30) 


is equivalent to (28). 
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For the variance of f3i^, observe that 


0 ^rij^ ,k) “I" ,k{fi rij^ ^nk,k)- 

Thus, we have An^^k$nk,k + Mnk,k0nk,k) ~ Using the above approximation, the 

variance formula is given by 

k 

^k —{Ak-l + Anf,^k) ni/A^^^g)\{^Ak-l + A^^^k) ] 

£=1 

= {Ak-l + An^,,k) ^{Ak-l'Vk-lAk_i + Ank,k^nk,kA^^^k)[i^k-l + An^^k) (31) 

for /c = 1, 2,... and Aq = Vq = Op. 

Remark 3.1 Under the normal linear regression model, all of the estimating eguation 
estimators become “exact”, in the sense that /3^ = (X'X)~^X'y = (3^j^ = = /3^. 


3.2 Online Updating for Wald Tests 

Wald tests may be used to test individual coefficients or nested hypotheses based upon ei¬ 
ther the CEE or CUEE estimators from the cumulative data. Let 0k = 0k,i, ■ ■ ■, Pk,p)\^ k) 
refer to either the CEE regression coefficient estimator and corresponding variance in equa¬ 


tions (25) and (26), or the CUEE regression coefficient estimator and corresponding vari¬ 


ance in equations (30) and (31). 

To test Hq : = 0 at the update (j = 1,... ,p), we may take the Wald statis¬ 

tic = l3lj/va,T0k,j), or equivalently, zlj = Pk,j/se0k,j), where the standard error 


se{(3k,j) = Y var(/3fcj) and Yax[(3k,j) is the diagonal element of V^. The corresponding 
p-value is P{\Z\ > \zlj\) = P{xi P ^10 where Z and Xi standard normal and 1 
degree-of-freedom chi-squared random variables, respectively. 

The Wald test statistic may also be used for assessing the difference between a full model 
Ml relative to a nested submodel M2. If (3 is the parameter of model Ml and the nested 
submodel M2 is obtained from Ml by setting C/3 = 0, where C is a rank q contrast matrix 
and V is a consistent estimate of the covariance matrix of estimator the test statistic 
is ^ C'(CVC')“^C;9, which is distributed as Xq under the null hypothesis that C/3 = 0. 
As an example, if Ml represents the full model containing all p regression coefficients at 
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the update, where the hrst coefficient is an intercept, we may test the global null 
hypothesis Hq : (32 = ■ ■ ■ = (^p = 0 with wl = where C is (p — 1) x p 

matrix c = [o,Vi] and the corresponding p-value is P{xl-i ^ '>^1)- 


3.3 Asymptotic Results 


In this section, we show consistency of the CUBE estimator. Specihcally, Theorem 3.2 


shows that, under regularity, if the EE estimator based on the all N observations (3pf is a 
consistent estimator and the partition number K goes to inhnity, but not too fast, then the 
CUEE estimator /3^ is also a consistent estimator. We hrst provide the technical regularity 
conditions. We assume for simplicity of notation that Uk = n for all k = 1,2,..., K. Note 


that conditions (C1-C6) were given in Lin and Xi| (2011), and are provided below for 
completeness. The proof of the theorem can be found in the Supplementary Material. 


(Cl) The score function -0 is measurable for any Exed (3 and is twice continuously differ¬ 
entiable with respect to (3. 

(C2) The matrix jg semi-positive dehnite (s.p.d.), and — positive 

dehnite (p.d.) in a neighborhood of (3q when n is large enough. 

(C3) The EE estimator I3^ k is strongly consistent, i.e. (3n,k /^o almost surely (a.s.) as 

n —)■ cxD. 


(C4) There exists two p.d. matrices, Ai and A 2 such that Ai < n < A 2 for all 

k = 1,... ,K, i.e. for any v G v'Aiv < ?7,“^v'A„^fcV < v'Aiv, where A„_fc is given 


in (|^. 

(C5) In a neighborhood of (3^, the norm of the second-order derivatives ^ bounded 

dfj 

uniformly, i.e. || ^ ^ where C2 is a constant. 

d(3 

(C6) There exists a real number a G (1/4,1/2) such that for any p > 0, the EE estimator 
$^k satishes P(n"||/3„^fc — /3ol| > p) < Cnn^°‘~^, where C,, > 0 is a constant only 
depending on rj. 
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Rather than using condition (C4), we will use a slightly modihed version which focuses 
on the behavior of An,k{(3) for all (3 in the neighborhood of (3q (as in (C5)), rather than 
just at the subset estimate $nk- 

(C4’) In a neighborhood of /3o, there exists two p.d. matrices Ai and A 2 such that Ai < 
n“^A„ fc(/3) < A 2 for all f3 in the neighborhood of /3 q and for all k = 1,K. 


Theorem 3.2 Let f3^ be the EE estimator based on entire data. Then under (C1)-(C2), 
(C4’)-(C6), if the partition number K satisfies K = OijE) for some 0 < 7 < min{l — 
2a, 4a — 1}, we have P(a//V||^;^ — > ^) = o(l) for any 5 > 0. 


Remark 3.3 If ni- ^ n for all k, Theorem 3.2 will still hold, provided for each k, 


nk-l 

rT-k 


IS 


bounded, where nk_i and nk are the respective sample sizes for subsets k — 1 and k. 


Remark 3.4 Suppose N independent observations (j/i,Xj), i = 1,...,N, where y is a 
scalar response and ^ is a p-dimensional vector of predictor variables. Eurther suppose 
^iVi) = for i = I,... ,N for g a continuously differentiable function. Under mild 


regularity conditions, Lin and Xi (2011) show in their Theorem 5.1 that condition (C6) is 
satisfied for a simplified version of the quasi-likelihood estimator of f3 (Chen et al., 1990( , 
given as the solution to the estimating equation 


N 


W) = = 0 . 


i=l 


3.4 Rank Deficiencies in 

Suppose N independent observations {yi,:x.i), i = 1,... ,N, where y is a scalar response and 
X is a p-dimensional vector of predictor variables. Using the same notation from the linear 
model setting, let {yki,^ki), i = I, ■ ■ ■ ,nk, be the observations from the k^^ subset where 
Yk = iVki, yk2, • ■ ■, Vkuk)' and X*. = (x^i, Xfc2, • • •, For subsets k in which X*. is not 

of full rank, we may have difficulty in solving the subset EE to obtain which is used 


to compute both the AEE/CEE and CUEE estimators for /3 in (22) and (28), respectively. 
However, just as in the linear model case, we can show under certain conditions that if 


X = (Xj,X 2 ,... ,X'^)' has full column rank p, then the estimators (3^^ ia (22) and (3j^ 


in (28) for some terminal K will be unique. 
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Specifically, consider observations (y^jX^) such that E{yki) = Hki = gijlki) with r]ki = 
x'^-/3 for some known function g. The estimating function ip for the dataset is of the 
form 

'4^{,'^kii ^') IT/jj ijjki j-^ki) i ^ !-)•••) ^ki 

where Ski = d^ki/dgki, and Wki is a positive and possibly data dependent weight. Specif¬ 
ically, Wki may depend on (3 only through rjki- In matrix form, the estimating equation 
becomes 


= 0 , 


(32) 


where Sr = Diag(Sn,... ,Sr„J, Wr = Diag(hl^ri,... and = (fia, ■ ■. 

With Sr, Wr, and /if. evaluated at some initial value the standard Newton- 


Raphson method for the iterative solution of (32) solves the linear equations 


X'rSiWrSrXr(/3 - /3'“>) = X'rSiWr(yr - Mr) 


(33) 


for an updated (3. 


Rewrite equation (33) as 


xpspWkSkXkf3 = X;S',WfcVfc 


(34) 


where = y*. — SfcXfc/3*'°\ Equation (|34|) is the normal equation of a weighted least 


squares regression with response v^, design matrix S^Xa,, and weight W^. Therefore the 
iterative reweighted least squares approach (IRLS) can be used to implement the Newton- 


Raphson method for an iterative solution to (32) (e.g.. Green, 1984). 


Rank dehciency in X^ calls for a generalized inverse of X'^S'^WfcSfcXfc. In order to 


show uniqueness of estimators /3j.^j^ in (22) and /3^ in (28) for some terminal K, we must 
hrst establish that the IRLS algorithm will work and converge for subset k given the same 
initial value when X^ is not of full rank. Upon convergence of IRLS at subset k with 
solution $nf,,k: '^6 must then verify that the CEE and CUEE estimators that rely on 13^^ k 
are unique. The following proposition summarizes the result; the proof is provided in the 
Supplementary Material. 
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Proposition 3.5 Under the above formulation, assuming that conditions (C1-C3) hold 


for a full-rank sub-column matrix ofKk, estimators /3 j^k (22) and (3j^ in (28) for some 

terminal K will be unique provided X is of full rank. 


The simulations in Section 4.2 consider rank deficiencies in binary logistic regression 
and Poisson regression. Note that for these models, the variance of the estimators /3^ and 
are given by = (J2k=i or = (J2k=i ^nk,k)~^■ For robust sandwich 
estimators, for those subsets k in which A„^^fc is not invertible, we replace 


and nk,k-^nk,k “meat” of equations (26) and (|31[), respectively, with an es¬ 


timate of Qnfc.fc from (24). In particular, we use Qnfc,fc = Er=iV'(zfc*,/3fc)^(zfci,^fc)^ for 
the CEE variance and Qn^.A: = f3k)'f’{'^ki, f3k)~^ for the CUEE variance. We 


use these modifications in the robust Poisson regression simulations in Section 4.2.2 for 


the CEE and CUEE estimators, as by design, we include binary covariates with somewhat 
low success probabilities. Consequently, not all subsets k will observe both successes and 
failures, particularly for covariates with success probabilities of 0.1 or 0.01, and the corre¬ 
sponding design matrices will not always be of full rank. Thus A„j, ^ will not always be 
invertible for finite but will be invertible for large enough Uk- We also perform proof of 


concept simulations in Section |4.2.3| in binary logistic regression, where we compare CUEE 
estimators under different choices of generalized inverses. 


4 Simulations 


4.1 Normal Linear Regression: Residual Diagnostic Performance 


In this section we evaluate the performance of the outlier tests discussed in Section 2.3.2[ 
Let k* denote the index of the single subset of data containing any outliers. We generated 
the data according to the model 


Vki ^ki(^ ^ki T bkSrjki, i 1,..., nk, 


(35) 


where bk = 0 if k ^ k* and bk ~ Bernoulli(0.05) otherwise. Notice that the first two 
terms on the right-hand-side correspond to the usual linear model with (3 = (1, 2, 3,4, 5)', 
Xki[ 2 :b] ~ A^( 0 ,l 4 ) independently, Xki[i] = 1, and Cki are the independent errors, while the 
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final term is responsible for generating the outliers. Here, rj^i ~ Exp(l) independently 
and 6 is the scale parameter controlling magnitude or strength of the outliers. We set 
6 G {0,2,4,6} corresponding to “no”, “small”, “medium”, and “large” outliers. 


To evaluate the performance of the individual outlier test in (17), we generated the 
random errors as ~ N(0,1). To evaluate the performance of the global outlier tests in 


(18) and (19), we additionally considered eki as independent skew-t variates with degrees of 
freedom z/ = 3 and skewing parameter 7 = 1.5, standardized to have mean 0 and variance 
1. To be precise, we use the skew t density 


9{x) = 


7 +- 

' 7 


ifiix) for a; < 0 


for a: > 0 

7+7 — 


where f{x) is the density of the t distribution with u degrees of freedom. 

For all outlier simulations, we varied k*, the location along the data stream in which 
the outliers occur. We also varied Uk = rik* G {100,500} which additionally controls 
the number of outliers in dataset k*. For each subset i = 1,..., k* — 1 and for 95% of 
observations in subset k*, the data did not contain any other outliers. 


To evaluate the global outlier tests (18) and (19) with m = 2, we estimated power 
using B = 500 simulated data sets with signihcance level a = 0.05, where power was 
estimated as the proportion of 500 datasets in which Fk* > F(0.95,n^*, W*-i — 5) or 
Fk* > 7^(0.95, 2, — !)• The power estimates for the various subset sample sizes rifc., 

locations of outliers /c*, and outlier strengths 5 appear in Table [Tj When the errors were 
normally distributed (top portion of table), notice that the Type I error rate was controlled 
in all scenarios for both the F test and asymptotic F test. As expected, power tends to 
increase as outlier strength and/or the number of outliers increase. Furthermore, larger 
values of fc*, and hence greater proportions of “good” outlier-free data, also tend to have 
higher power; however, the magnitude of improvement decreases once the denominator 
degrees of freedom (W*-i — p or W*-i — m -|- 1) become large enough, and the F tests 


essentially reduce to tests. Also as expected, the F test given by (18) is more powerful 


than the asymptotic F test given in (19) when, in fact, the errors were normally distributed. 
When the errors were not normally distributed (bottom portion of table), the empirical 


type I error rates of the F test given by (18) are severely inflated and hence, its empirical 
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Table 1: Power of the outlier tests for various locations of outliers {k*), subset sample sizes 
(ufc = Uk*), and outlier strengths (no, small, medium, large). Within each cell, the top 
entry corresponds to the normal-based F test and the bottom entry corresponds to the 
asymptotic F test that does not rely on normality of the errors. 


Outlier 


» = 100 (5 true outliers) 


= 500 (25 true outliers) 

Strength 

k* = b 

k* = 10 

k* = 25 

k* = 100 

k* = 5 

k* = 10 

k* = 25 k* 

= 100 


F Test/Asymptotic F Test(m=2) 

F Test/Asymptotic F Test(m=2) 

Standard Normal Errors 







none 

0.0626 

0.0596 

0.0524 

0.0438 

0.0580 

0.0442 

0.0508 

0.0538 


0.0526 

0.0526 

0.0492 

0.0528 

0.0490 

0.0450 

0.0488 

0.0552 

small 

0.5500 

0.5690 

0.5798 

0.5718 

0.9510 

0.9630 

0.9726 

0.9710 


0.2162 

0.2404 

0.2650 

0.2578 

0.6904 

0.7484 

0.7756 

0.7726 

medium 

0.9000 

0.8982 

0.9094 

0.9152 

1.0000 

1.0000 

1.0000 

1.0000 


0.5812 

0.6048 

0.6152 

0.6304 

0.9904 

0.9952 

0.9930 

0.9964 

large 

0.9680 

0.9746 

0.9764 

0.9726 

1.0000 

1.0000 

1.0000 

1.0000 


0.5812 

0.6048 

0.6152 

0.6304 

0.9998 

1.0000 

1.0000 

1.0000 

Standardized Skew t Errors 







none 

0.2400 

0.2040 

0.1922 

0.1656 

0.2830 

0.2552 

0.2454 

0.2058 


0.0702 

0.0630 

0.0566 

0.0580 

0.0644 

0.0580 

0.0556 

0.0500 

small 

0.5252 

0.4996 

0.4766 

0.4520 

0.7678 

0.7598 

0.7664 

0.7598 


0.2418 

0.2552 

0.2416 

0.2520 

0.6962 

0.7400 

0.7720 

0.7716 

medium 

0.8302 

0.8280 

0.8232 

0.8232 

0.9816 

0.9866 

0.9928 

0.9932 


0.5746 

0.5922 

0.6102 

0.6134 

0.9860 

0.9946 

0.9966 

0.9960 

large 

0.9296 

0.9362 

0.9362 

0.9376 

0.9972 

0.9970 

0.9978 

0.9990 


0.7838 

0.8176 

0.8316 

0.8222 

0.9988 

0.9992 

0.9998 

1.0000 


Power with “outlier strength = no” are Type I errors. 


power in the presence of outliers cannot be trusted. The asymptotic F test, however, 
maintains the appropriate size. 


For the outlier f-test in (17), we examined the average number of false negatives (FN) 
and average number of false positives (FP) across the B = 500 simulations. False negatives 
and false positives were declared based on a Benjamin!-Hochberg adjusted p-value threshold 
of 0.10. These values were plotted in solid lines against outlier strength in Figure [T] for 
Uk* = 100 and Uk* = 500 for various values of k* and 6. Within each plot the FN decreases 
as outlier strength increases, and also tends to decrease slightly across the plots as k* 
increases. FP increases slightly as outlier strength increases, but decreases as k* increases. 
As with the outlier F test, once the degrees of freedom W*-i — P get large enough, the t- 
test behaves more like a ^-test based on the standard normal distribution. For comparison, 
we also considered FN and FP for an outlier test based upon the externally studentized 
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Outliers in Subset k*=2 Outliers in Subset k*=25 Outliers in Subset k*=100 




Figure 1: Average numbers of False Positives and False Negatives for outlier t-tests for 
Ufc* = 100 (top) and Uk* = 500 (bottom). Solid lines correspond to the predictive residual 
test while dotted lines correspond to the externally studentized residuals test using only 
data from subset k*. 


residuals from subset k* only. Specihcally, under model (14), the externally studentized 


residuals tk*i as given by (13) follow a t distribution with Uk* — p — 1 degrees of freedom. 
Again, false negatives and false positives were declared based on a Benjamini-Hochberg 
adjusted p-value threshold of 0.10, and the FN and FP for the externally studentized 
residual test are plotted in dashed lines in Figure [T] for Uk* = 100 and Uk* = 500. This 
externally studentized residual test tends to have a lower FP, but higher FN than the 
predictive residual test that uses the previous data. Also, the FN and FP for the externally 
studentized residual test are essentially constant across k* for hxed Uk*, as the externally 


21 




































0 . 0150 - 


0 . 0125 - 



Number of Blocks (K) 

Figure 2: RMSE comparison between CEE and CUEE estimators for different numbers of 
blocks. 

studentized residual test relies on only the current dataset of size and not the amount 
of previous data controlled by k*. Consequently, the predictive residual test has improved 
power over the externally studentized residual test, while still maintaining a low number 
of FP. Note that the average false discovery rate for the predictive residual test based on 
Benjamini-Hochberg adjusted p-values was controlled in all cases except when k* = 2 and 
Uk* = 100, representing the smallest sample size considered. 

4.2 Simulations for Estimating Equations 

4.2.1 Logistic Regression 

To examine the effect of the total number of blocks K on the performance of the CEE and 
CUEE estimators, we generated yi ~ Bernoulh(/ri), independently for z = 1,..., 100000, 
with logit(/ij) = x'/3 where /3 = (1,1,1,1,1,1)', Xi[ 2 -A] ~ Bernoulli(0.5) independently, 
3^i[5:6] ~ l\^(0,l2) independently, and Xki[i] = 1- The total sample size was hxed at iV = 
100000, but in computing the CEE and CUEE estimates, the number of blocks K varied 
from 10 to 1000 where N could be divided evenly by K. At each value of K, the root- 
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Figure 3: Boxplots of biases for 3 types of estimators (CEE, CUEE, EE) of (3j (estimated 
/3j - true /3j), j = 1,..., 5, for varying nj,. 

mean square error (RMSE) of both the CEE and CUEE estimators were calculated as 
1 ) ^ represents the coefficient in either the CEE or CUEE terminal 

estimate. The averaged RMSEs are obtained with 200 replicates. Figure shows the plot 
of averaged RMSEs versus the number of blocks K. It is obvious that as the number of 
blocks increases (block size decreases), RMSE from CEE method increases very fast while 
RMSE from the CUEE method remains relatively stable. 

4.2.2 Robust Poisson Regression 

In these simulations, we compared the performance of the (terminal) CEE and CUEE 
estimators with the EE estimator based on all of the data. We generated B = 500 
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Figure 4: Boxplots of standard errors for 3 types of estimators (CEE, CUEE, EE) of /3j, 
j = 1,... ,5, for varying Standard errors have been multiplied by y/Kn^ = VN for 
comparability. 

datasets of i/i ~ Poisson(pj), independently for i = 1,..., iV with log(/ij) = x'/3 where /3 = 
(0.3, —0.3,0.3, —0.3, 0.3)', Xkiii] = 1, Xi[2:3] ~ -^(0, 12 ) independently, Xjpj ~ Bernoulli{0.25) 
independently, and Xi^ ~ Bernoulli{Q.l) independently. We hxed K = 100, but varied 
nk = ne {50,100,500}. 

Figure [^shows boxplots of the biases in the 3 types of estimators (CEE, CUEE, EE) of 
/3j, j = 1,..., 5, for varying Uk- The CEE estimator tends to be the most biased, particularly 
in the intercept, but also in the coefficients corresponding to binary covariates. The CUEE 
estimator also suffers from slight bias, while the EE estimator performs quite well, as 
expected. Also as expected, as increases, bias decreases. The corresponding robust 
(sandwich-based) standard errors are shown in Figure]^ but the results were very similar 
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Table 2: Root Mean Square Error Ratios of CEE and CUEE with EE 






^2 

/Ss 



Uk 

= 50 

CEE 

4.133 

1.005 

1.004 

1.880 

2.288 



CUEE 

1.180 

1.130 

1.196 

1.308 

1.403 

nk 

= 100 

CEE 

2.414 

1.029 

1.036 

1.299 

1.810 



CUEE 

1.172 

1.092 

1.088 

1.118 

1.205 

nk 

= 500 

CEE 

1.225 

1.002 

1.002 

1.060 

1.146 



CUEE 

0.999 

1.010 

1.016 

0.993 

1.057 


for variances estimated by and In the plot, as Uk increases, the standard errors 
become quite similar for the three methods. 

Table [2] shows the coefficient-wise RMSE ratios : 

RMSE(CEE) RMSE(CUEE) 

RMSE(EE) RMSE(EE) ’ 

where we take the RMSE of the EE estimator as the gold standard. The RMSE ratios 
for CEE and CUEE estimators con&m the boxplot results in that the intercept and the 
coefficients corresponding to binary covariates (/34 and /^s) tend to be the most problematic 
for both estimators, but more so for the CEE estimator. 

For this particular simulation, it appears = 500 is sufficient to adequately reduce 
the bias. However, the appropriate subset size n^, if given the choice, is relative to the 
data at hand. For example, if we alter the data generation of the simulation to instead 
have Xi[ 5 ] ~ Bernoulli{Q.Ql) independently, but keep all other simulation parameters the 
same, the bias, particularly for /Ss, still exists at = 500 (see Figure but diminishes 
substantially with = 5000. 


4.2.3 Rank Deficiency and Generalized Inverse 

Consider the CUEE estimator for a given dataset under two choices of generalized inverse, 
the Moore-Penrose generalized inverse, and a generalized inverse generated according to 


Theorem 2.1 of Rao and Mitra (1972). For this small-scale, proof-of-concept simulation, 
we generated B = 100 datasets of yi ~ Bernoulh(/ij), independently for i = 1,..., 20000, 
with logit(pi) = x'/3 where /3 = (1,1,1,1,1)', Xip] ~ Bernoulli(0.5) independently, Xi[ 3 : 5 ] ~ 
A(0,l3) independently, and Xfcj[i] = 1. We hxed A = 10 and Uk = 2000. The pairs of 
(l/j,Xj) observations were considered in different orders, so that in the hrst ordering all 
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nk=1000, betas 


nk=5000, betas 



CEE CUEE EE 



CEE CUEE EE 



CEE CUEE EE 


Figure 5: Boxplots of biases for 3 types of estimators (CEE, CUEE, EE) of (3^ (estimated 
/S 5 - true /Ss), for varying when Xi\^^ ~ Bernoulli{Q.Ql). 


subsets would result in Anf,,k being of full rank, k = 1 ,... ,iF, but in the second ordering 
all of the subsets would not have full rank ^ due to the grouping of the zeros and 
ones from the binary covariate. In the hrst ordering, we used the initially proposed CUEE 


estimator (3^^ in (28) to estimate (3 and its corresponding variance in (31). In the 
second ordering, we used two different generalized inverses to compute denoted by 


CUEE^”^ and CUEE^”'* in Table 5, with variance given by The estimates reported 

in Table were averaged over 100 replicates. The corresponding EE estimates, which are 
computed by htting all N observations simultaneously, are also provided for comparison. 
As expected, the values reported for CUEE^”^ and CUEEg”^ are identical, indicating that 
the estimator is invariant to the choice of generalized inverse, and these results are quite 
similar to those of the EE estimator and CUEE estimator with all full-rank matrices 
k = l,...,K. 


(-) 


5 Data Analysis 

We examined the airline on-time statistics, available at http://stat-computing.org/ 
dataexpo/2009/the-data.html. The data consists of flight arrival and departure details 
for all commercial flights within the USA, from October 1987 to April 2008. This involves 
N = 123,534,969 observations and 29 variables (~ 11 GB). 

We hrst used logistic regression to model the probability of late arrival (binary; 1 if late 
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Table 3: Estimates and standard errors for CUEE^ \ CUEE 2 \ CUEE, and EE estimators. 
CUEE^”^ and CUEE^”^ correspond to CUEE estimators nsing two different generalized 
inverses for when An^^k is not invertible. 


cuee)”^ 

^Kj 

CUEE^”^ 

^Kj 

CUEE 

Pkj S^0Kj) 

EE 

Pnj se0Nj) 

0.9935731 

0.02850429 

0.9935731 

0.02850429 

0.9940272 

0.02847887 

0.9951570 

0.02845648 

0.8902375 

0.03970919 

0.8902375 

0.03970919 

0.8923991 

0.03936931 

0.8933344 

0.03935490 

0.9872035 

0.02256396 

0.9872035 

0.02256396 

0.9879017 

0.02247598 

0.9891857 

0.02245082 

0.9916863 

0.02264102 

0.9916863 

0.02264102 

0.9925716 

0.02248187 

0.9938864 

0.02246949 

0.9874042 

0.02260353 

0.9874042 

0.02260353 

0.9882167 

0.02247671 

0.9895110 

0.02244759 


by more than 15 minntes, 0 otherwise) as a fnnction of departnre time (continnous); distance 
(continnons, in thonsands of miles), day/night flight status (binary; 1 if departure between 
8 pm and 5am, 0 otherwise); weekend/weekday status (binary; 1 if departure occurred 
during the weekend, 0 otherwise), and distance type (categorical; ‘typical distance’ for 
distances less than 4200 miles, the reference level ‘large distance’ for distances between 
4200 and 4300 miles, and ‘extreme distance’ for distances greater than 4300 miles) for 
N = 120, 748, 239 observations with complete data. 

For CEE and CUEE, we used a subset size of = 50,000 for k = 1,..., K — 1, 
and riK = 48239 to estimate the data in the online-updating framework. However, to 
avoid potential data separation problems due to rare events (extreme distance; 0.021% of 
the data with 26,021 observations), a detection mechanism has been introduced at each 
block. If such a problem exists, the next block of data will be combined until the problem 
disappears. We also computed EE estimates and standard errors using the commercial 
software Revolution R. 

All three methods agree that all covariates except extreme distance are highly associated 
with late flight arrival {p < 0.00001), with later departure times and longer distances 
corresponding to a higher likelihood for late arrival, and night-time and weekend flights 
corresponding to a lower likelihood for late flight arrival (see Table |^. However, extreme 
distance is not associated with the late flight arrival {p = 0.613). The large p value 
also indicates that even if number of observations is huge, there is no guarantee that all 
covariates must be signihcant. As we do not know the truth in this real data example, we 


27 










Table 4: Estimates and standard errors (xlO®) from the Airline On-Time data for EE 


(compnted by Revolntion R), CEE, and CUEE estimators. 



EE 

Pni 

se{^Ni) 

CEE 

se{^Ki) 

CUEE 

^Kj 

se{^Ki) 

Intercept 

-3.8680367 

1395.65 

-3.70599823 

1434.60 

-3.880106804 

1403.49 

Depart 

0.1040230 

6.01 

0.10238977 

6.02 

0.101738105 

5.70 

Distance 

0.2408689 

40.89 

0.23739029 

41.44 

0.252600016 

38.98 

Night 

-0.4483780 

81.74 

-0.43175229 

82.15 

-0.433523534 

80.72 

Weekend 

-0.1769347 

54.13 

-0.16943755 

54.62 

-0.177895116 

53.95 

TypDist 

0.8784740 

1389.11 

0.76748539 

1428.26 

0.923077960 

1397.46 

ExDist 

-0.0103365 

2045.71 

-0.04045108 

2114.17 

-0.009317274 

2073.99 


compare the estimates and standard errors of CEE and CUEE with those from Revolntion 
R, which compntes the EE estimates, bnt notably not in an online-npdating framework. 
In Table the CUEE and Revolntion R regression coefficients tend to be the most similar. 
The regression coefficient estimates and standard errors for CEE are also close to those from 
Revolntion R, with the most discrepancy in the regression coefficients again appearing in 
the intercept and coefficients corresponding to binary covariates. 

We hnally considered arrival delay {ArrDelay) as a continnons variable by modeling 
\og{ArrDelay — min{ArrDelay) -|- 1) as a fnnction of departnre time, distance, day/night 
flight statns, and weekend/weekday flight statns for United Airline flights (A^ = 13, 299, 817), 


and applied the global predictive residnal ontlier tests discnssed in Section |2.3.2[ Using 
only complete observations and setting = 1000, m = 3, and a = 0.05, we fonnd that the 


normality-based F test in (18) and asymptotic F test in (19) overwhelmingly agreed npon 
whether or not there was at least one ontlier in a given snbset of data (96% agreement 
across K = 12803 snbsets). As in the simnlations, the normality-based F test rejects more 
often than the asymptotic F test: in the 4% of snbsets in which the two tests did not agree, 
the normality-based F test alone identihed 488 additional snbsets with at least one ontlier, 
while the asymptotic F test alone identihed 23 additional snbsets with at least one ontlier. 













6 Discussion 


We developed online-updating algorithms and inferences applicable for linear models and 
estimating equations. We used the divide and conquer approach to motivate our online- 
updated estimators for the regression coefficients, and similarly introduced online-updated 
estimators for the variances of the regression coefficients. The variance estimation allows 
for online-up dated inferences. We note that if one wishes to perform sequential testing, 
this would require an adjustment of the a level to account for multiple testing. 

In the linear model setting, we provided a method for outlier detection using predictive 
residuals. Our simulations suggested that the predictive residual tests are more powerful 
than a test that uses only the current dataset in the stream. In the EE setting, we may 
similarly consider outlier tests also based on standardized predictive residuals. For example 
in generalized linear models, one may consider the sum of squared predictive Pearson or 
Deviance residuals, computed using the coefficient estimate from the cumulative data (i.e., 
or /3fc_i). It remains an open question in both settings, however, regarding how to 
handle such outliers when they are detected. This is an area of future research. 

In the estimating equation setting, we also proposed a new online-up dated estimator 
of the regression coefficients that borrows information from previous datasets in the data 
stream. The simulations indicated that in finite samples, the proposed CUEE estimator is 


less biased than the AEE/CEE estimator of Lin and Xi (2011). However, both estimators 
were shown to be asymptotically consistent. 

The methods in this paper were designed for small to moderate covariate dimensionality 
p, but large N. The use of penalization in the large p setting is an interesting consideration. 


and has been explored in the divide-and-conquer context in Chen and Xie (2014) with 
popular sparsity inducing penalty functions. In our online-updating framework, inference 
for the penalized parameters would be challenging, however, as the computation of variance 
estimates for these parameter estimates is quite complicated and is also an area of future 
work. 

The proposed online-updating methods are particularly useful for data that is obtained 
sequentially and without access to historical data. Notably, under the normal linear re¬ 
gression model, the proposed scheme does not lead to any information loss for inferences 
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involving /3, as when the design matrix is of full rank, Q and (|^ are sufficient and complete 
statistics for (3 and cr^. However, under the estimating equation setting, some information 
will be lost. Precisely how much information needs to be retained at each subset for specific 
types of inferences is an open question, and an area devoted for future research. 


References 

Benjamini, Y. and Hochberg, Y. (1995), “Controlling the False Discovery Rate: A Practical 
and Powerful Approach to Multiple Testing,” Journal of the Royal Statistical Society. 
Senes B, 57, 289-300. 

Benjamini, Y. and Yekutieli, D. (2001), “The control of the false discovery rate in multiple 
testing under dependency,” Annals of Statistics, 29, 1165-1188. 

Chen, K., Hu, L, and Ying, Z. (1999), “Strong consistency of maximum quasi-likelihood 
estimators in generalized linear models with fixed and adaptive designs,” The Annals of 
Statistics, 27, 1155. 

Chen, X. and Xie, M.-g. (2014), “A Split-and-Conquer Approach For Analysis of Extraor¬ 
dinarily Large Data,” Statistica Sinica, preprint. 

DeGroot, M. and Schevish, M. (2012), Probability and Statistics, Boston, MA: Pearson 
Education. 

Green, P. (1984), “Iteratively Reweighted Least Squares for Maximum Likelihood Esti¬ 
mation, and some Robust and Resistant Alternatives,” Journal of the Royal Statistical 
Society Series B, 46, 149-192. 

Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., and Gleveland, W. S. (2012), “Large 
complex data: divide and recombine (D&R) with RHIPE,” Stat, 1, 53-67. 

Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. 1. (2014), “A Scalable Bootstrap for 
Massive Data,” Journal of the Royal Statistical Society: Series B (Statistical Methodol¬ 
ogy), 76, 795-816. 


30 


Liang, F., Cheng, Y., Song, Q., Park, J., and Yang, P. (2013), “A Resampling-based 
Stochastic Approximation Method for Analysis of Large Geostatistical Data,” Journal 
of the American Statistical Association, 108, 325-339. 

Lin, N. and Xi, R. (2011), “Aggregated Estimating Equation Estimation,” Statistics and 
Its Interface, 4, 73-83. 

Ma, P., Mahoney, M. W., and Yu, B. (2013), “A Statistical Perspective on Algorithmic 
Leveraging,” arXiv preprint arXiv:1306.5362. 

Muirhead, R. J. (2009), Aspects of multivariate statistical theory, vol. 197, John Wiley & 
Sons. 

Rao, C. and Mitra, S. K. (1972), “Generalized inverse of a matrix and its applications.” in 
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, 
Berkeley, Calif.: University of California Press, vol. 1, pp. 601-620. 

Searle, S. (1971), Linear Models, New York-London-Sydney-Toronto; John Wiley and Sons, 
Inc. 

Stengel, R. F. (2012), Optimal control and estimation. Courier Corporation. 

Wang, C., Chen, M.-H., Schifano, E., Wu, J., and Yan, J. (2015), “A Survey of Statistical 
Methods and Computing for Big Data,” arXiv preprint arXiv:1502.07989. 


31 



Supplementary Material: Online Updating of 
Statistical Inference in the Big Data Setting 

A: Bayesian Insight into Online Updating 

A Bayesian perspective provides some insight into how we may construct our online¬ 
updating estimators. Under a Bayesian framework, using the previous k — 1 subsets of data 
to construct a prior distribution for the current data in subset k, we immediate identify the 
appropriate online updating formulae for estimating the regression coefficients and the error 
variance. Conveniently, these formulae require storage of only a few low-dimensional quan¬ 
tities computed only within the current subset; storage of these quantities is not required 
across all subsets. 

We hrst assume a joint conjugate prior for (/3, as follows: 

7r{f3,a‘^\HQ,yo,^o,ro) = 7r{f3\a'^, fiQ,\o)7r{a^\uo,To), (A.l) 

where is a prespecihed p-dimensional vector, Vq is a p x p positive dehnite precision 
matrix, z/q > 0, Tq > 0 , and 

IV PA f \ ■) 

7r(/3|a^MoAo) = ^^^^^^expj - — (/3-Aio)'Vo(/3-Mo)|> 

When the data Di = {(yi,Xi)} is available, the likelihood is given by 
L{f3,a^\Di) oc -^^^^y^expj - ^(yi - Xi/3)'(yi - Xi/3)|. 

After some algebra, we can show that the posterior distribution of (/3, is then given by 
7 r(/3, a‘^\Di, Vq, z/q, tq) = 7r{f3\a^, Vi)7r(o-^|z/i, n). 


SI 



where — (X'^Xi + Vq) ^(X'^Xi^^^ + Vo^tg), Vi — X'^Xi + Vq, vi — ni + vq, and 
= T-o + (yi - Xi^„^ J'(yi - Xi^„^ i) + ^oVoA^o + ^ see, for 


example, Section 8.6 of DeGroot and Schevish (2012). Using mathematical indnction, we 


can show that given the data = {(y^, X^), £ = 1, 2,..., A;}, the posterior distribntion of 


(/3, (T^) is 7 r(/ 3 , Vfc, z/fc, Tfc), which has the same form as in (A.l) with (^g, Vq, z/q, tq) 

npdated by (aa^,, V*,, z/^, ta,), where 

= (x',x, + 

Vfc= X',Xfc + Vfc_i, 

i'k= Uk + Uk-i, (^- 2 ) 

Tfc = Tk-l + {jk - Xfc^„^ fc)'(yfc - ^k^nk,k) 

+H'k_lVk-lfJ^k-l + hnk,l^'k^khnk,k ~ f^'k^kfJ^k^ 
for k = 1,2,.... The data stream structnre hts the Bayesian paradigm perfectly and the 
Bayesian online npdating sheds light on the online npdating of LS estimators. Let 0^ 
MSEfc denote the LS estimate of (3 and the corresponding MSE based on the cnmnlative 
data Dk = {(y^, X^), £ = 1 , 2 ,..., k}. As a special case of Bayesian online npdate, we can 
derive the online npdates of $k ^^^1 MSEa,. Specihcally, we take i and nse the 


npdating formnla for H). in (A. 2 ). That is, taking ^tg = 0 and Vg = Op in (A. 2 ), we obtain 


k = (x;xi + 


(A.3) 


where = 0, Pn^k is dehned by ([^ or ([^ and Va, = for /c = 1 , 2 ,.... 

Similarly, taking z/q = no = 0 , tq = SSEq = 0 , and nsing the npdating formnla for ta, in 


(A.2), we have 


SSE, = SSE,_i + SSE„„, + ^;_iV,_i^,_i+^l^,,XlX,^„^,,-^;Vfc^, (A.4) 

where SSE^^. a; is the residnal snm of sqnares from the k^^ dataset, with corresponding 
residnal mean sqnare MSE„j,^a: =SSE„j,_A:/(nA: —p)- The MSE based on the data Dk is then 


S2 









MSEfc = SSEfc/(A^fc — p) where = Yli=i (= + -^fe-i) for k = 1,2,.... Note that for 

k = K, equations (A.3|) and (A.4) are identical to those in (|^ and Q, respectively. 


B: Online Updating Statistics in Linear Models 


Below we provide online-up dated t-tests for the regression parameter estimates, the online- 
updated ANOVA table, and online-updated general linear hypothesis F-tests. Please refer 
to Section 12.2.11 of the main text for the relevant notation. 


Online Updating for Parameter Estimate t-tests in Linear Models. If our interest 
is only in performing t-tests for the regression coefficients, we only need to save the current 
values (Vfc, MSEfc) to proceed. Recall that var(^) = cr^(X'X)“^ and = 

MSE(X'X)“^. At the update, = MSE^V^^. Thus, to test Hq : (3j = 0 at the 

k^^ update (j = 1 ,... ,p), we may use j = (3k,j/se{(3k,j), where the standard error se{/3k,j) 
is the square root of the diagonal element of The corresponding p-value is 

P{\tN,-p\ > 

Online Updating for ANOVA Table in Linear Models. Observe that SSE is given 

by (|4|), 

K K 

SST = y'y - Nf = ^y^fc “ 

k=l k=l 


where 1^^, is an length vector of ones, and SSR = SST-SSE. If we wish to construct 
an online-updated ANOVA table, we must save two additional easily computable, low 
dimensional quantities: Syy^k = ZlLi Sy^k = Y!l=iy'i^ni = Y!1 =i YliLiVti- 

The online-updated ANOVA table at the update for the cumulative data Dk is 
constructed as in Table Note that SSE^ is computed as in (A.4). The table may 
be completed upon determination of an updating formula SST^.. Towards this end, write 

Pyy,k yk^k T Pyy,k—1 Sud Sy k yk^'n’k P Py,k—li for k 1 , . . . , K and Syy Q Py,0 0) 
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Table 5: Online-up dated ANOVA table 


ANOVA^, 

Table 




Source 

df 

SS 

MS 

F P-value 

Regression 

p — 1 

SSRfc 

MSRfc = 

p_l 

I?*_ iVlSRfe D/- IP ^ jp*\ 

^ - MSK ^ ) 

Error 

C Total 

A A 

1 1 

SSEfc 

SSTfc 

MSEfc = 

Nk-p 



that SSTfc = Syy,k - N-^Sl^ 

Online updated testing of General Linear Hypotheses {Hq ; C/3 = 0) are also possible: 
if C (g X p) is a full rank (g < p) contrast matrix, under Hq, 


F, = 


SSE, 


'-)/{ 


N, 


p 


Fq,N^.-p- 


Similarly, we may also obtain online updated coefficients of multiple determination, = 
SSRfc/SSTfc. 

To summarize, we need only save (Vk-i, ^k-i,FlSEk-i, Syy^k-i, Sy^k-i) from the 
previous accumulation point fc — 1 to perform online-updated f-tests for Hq : (3j = 0, 
j = 1,... ,p and online-up dated F-tests for the current accumulation point k] we do not 
need to retain (V^, Ni, MSE^, Syy^i, Sy^g) for £ = 1,..., — 2. 


C: 


Proof of Proposition 


2.4 


We hrst show that MSEfc_i A cx^. Since SSEfc_i = ^ 

we have 


plim MSEfc_i = plim ^ 

Nk-i - P 


= plim 

A^fc_ 1^00 


£fc_igfc-i 

Nk-i 


plim 

A^fc-l^OO 


e’,_,Xk-r{X'k_,Xk-i)-^K-iek-i 


2 s'^-iXk-i X'f._^Xk- 

= a - phm — - - phm (- - — 

Nk — l^OO ^'k—1 A^fc_i^OO ^'k — 1 


Nk-i 

-1 


i) 


plim 

Nk-i^oo 


Nk-i 
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Let Xj denote the column vector of for j = 1 ,... ,p. Since E{ei) = 0, Vi and all 

the elements of X^-i are bounded by C, by Chebyshev’s Inequality we have for any ^ and 
column vector Xj, 


Pi\ 


Nu-i 


>()< 


Varie'^_,X,) 


< 




and thus 


plim 


£'k_lXk_i 


Nk-i 


0 and 


plim MSEfc_i = — 0 ■ Q ^ ■ 0 = 


Next we show 


E- 




d 


Am 


First, recall that 


— Jk — Jk 

= Xfc/3 + efc — 'X.kPk-i 

= Xfc/3 + efc - Xk{Xi_,Xk-i)-^Xi_,y^_^ 

= £k — ^k{X'f^_^X k-l) ^X'f,_^£k-1- 

Consequently, var(efc) = {1^. + Xk{X'i^_^Xk-i)~^X'f,)a‘^ = F'Ea^, where F is an Uk x Uk 
invertible matrix. Let = (F')“^efc with var(e^) = Therefore, each component of 

el is independent and identically distributed. 

By the Central Limit Theorem and condition (4), we have for alH = 1,..., m, 

_L (I' p* ')2 
rik- ^ ki kiJ d 

-r-^ Yi, as Uk oo. 

Since each subgroup is also independent. 


E 


™ J-IF p* i 2 

i—1 rife ^ ki ki> ^ 2 

-^-^ Xmi 


as Uk —)■ oo. 


By Slutsky’s theorem. 


y-m A* ',2 

Xii=l rife, y-'-kiki) ^ 


MSEfe.i 


Am’ 


as Uk, Nk-i —)■ oo. 
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D: Computation of T for Asymptotic F test 

Recall that var(efc) = = TF' , where T is an x invert¬ 

ible matrix. For large n^, it may be challenging to compnte the Cholesky decomposition 
var(efc). One possible solntion that avoids the large issne is given as follows. 

First, we can easily obtain the Cholesky decomposition of {X'f._^X = P'P 
since it is a p x p matrix. Thns, we have 

var(efc) = + XfcP'PX',)-V 2 = (I„, + XfcX',)-V^ 

where X^ = X^P' is an x p matrix. 

Next, we compnte the singnlar valne decomposition on X^, i.e., X^. = UDV' where U 
is an Uk x n*, nnitary matrix, D is an x Uk diagonal matrix, and V is a x p nnitary 
matrix. Therefore, 

var(efc) = + UDD'UO’V^ = U(I,, + DDO’^UV^ 

Since + DD')“^ is a diagonal matrix, we can hnd the matrix Q snch that (I^^, -|- 
DD')“^ = Q'Q by straightforward calcnlation. One possible choice of P is UQ'. 


E: Proof of Theorem 13.2 


We nse the same dehnition and two facts provided by Lin and Xi (2011), given below for 
completeness. 


Definition 6.1 Let A be a d x d positive definite matrix. The norm of A is defined as 

l|A|| = 

Using the dehnition of the above matrix norm, one may verify the following two facts. 
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Fact A.l. Suppose that A is a. d x d positive definite matrix. Let A be the smallest 
eigenvalue of A, then we have v'Av > Av'v = A||v||^ for any vector v G On the 
contrary, if there exists a constant O > 0 such that v'Av > C ||v||^ for any vector v G 
then C* < A. 

Fact A.2. Let A be a d x d positive definite matrix and A is the smallest eigenvalue of A. 
If A > c > 0 for some constant c, one has ||A~^|| < c~^. 


In order to prove Theorem 3.2, we need the following Lemma. 


Lemma 6.2 Under (C4’) and (C6), satisfies the following condition: for any rj > 0, 


Proof of Lemma 6.2 (By induction) 

First notice that (C6) is equivalent to writing, for any rj > 0, ^ — /3q|| > 


p) = 0(l). 

Take k = 1, ^ and thus > v) = 0(1). 

Assume the condition holds for accumulation point k — 1: — /3ol| > 

7]) = 0(1). Write 


k-2 

^n,k-l (Afc_2 + An^k—l) (^^^n,e^n,i T -^n,k—l^n,k-l) 

£=1 

SO that, rearranging terms, we have 

k-2 

^ ^n,e^n,e ~ (Afc-2 + ^n,k-l)j^n,k-l ~ -^ri,k-l^n,k-l- 
e=i 

Using the previous relation, we may write as 

^n,k = (Afc-l + An^k) ^{^k-2$n,k-l + ^n,k-i$n,k-l + 

■^n,kf^n,k T ■^n,k-fif^n,k-l ~ l^n,k-l)) 

= (Afc_i + An,k) {^k-l^n,k-l ■^n,k^n,k A ■^ri,k-l{fin,k-l ~ ^n,k-l))- 
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Therefore, 


^n,k /^O + -^n,k) /^o) ■^n,k{fin,k 

^n,k-l{f^n,k-l ~ /^O + /^O ~ f^n,k-l)) 

and 

ll^n,fc~/3oll < ||(Afc-l + A„^fc) ^^k-l\\\\^n,k-l ~ f^o\\ + 

||(Afc_i + An^k) ^A„^fc|| — (3q\\ + 

||(Afc_i + An^k) ^A„_fc_i||— /3oll + 

II (Afc_i + An^k) A„^fc_l|| ll^^^;;.-! ~ /^oll 

Note that ||(Afc-i + A„,fc)~^Afc_i|| < 1 and ||(Afc_i + A„,fc)"^A„,fc|| < 1- Under (C4’), 
||(Afc_i + A„^fe)"iA„^fc_i|| < ||(A„^fc)"iA„^fc_i|| < ^ <C, where U is a constant, Ai > 0 is 
the smallest eigenvalue of Ai, and A 2 is the largest eigenvalue of A 2 . Note that if 7 ^ n 
for all k, then || (Afc_i + A„,fc)"^A„,fc-i|| < || (A„,fc)"^A„,fc_i|| < < C, where Uk-i/uk 

is bounded and U is a constant. Thus, 

ll^n,fc ~ /^oll < ll^n,fc-l ~ /^oll + ll^n,fc ~ /3oll + 

\\C0n,k-i-M\ + \\C0^^,_,-f3,)\\ 

Under (C 6 ) and the induction hypothesis, then for any f] > 0, 

n-^“+'P(|P„,^ - /3„|| > ^) < - All > £;)+ 

n- 2 “+‘F(||A,t-AII>£;)+ 
n-2“+‘ P(!IAa-i - All > ^)+ 
n- 2 “+‘F(||A,t_,-AII> 

Since all the four terms on the right hand side are 0(1) by assumption, n“^“’'“^P(||;3„^ — 

AII>T) = o(i)' ■ 

n" 
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3.2 


We are now ready to prove Theorem 
Proof of Theorem 13.21 
First, suppose that all the random variables are dehned on a probability space (f2,W, P). 
Let 


^n,k,V = - /3oll < v}, 

nN,rj = - /3o|| < v}, 


N,k,r] ^k=l^n-,k,ri Fl 


From Lemma 6.2, for any a; > 0, we have 

p(rs,,^,,) < P(n%j + ^ 

Pi^n,k,r, 


K 


k=l 


< n"“-^(0(l) + iF-O(l)) 

Since K=0{n'^), 7 < 1 — 2 a and 7 < a < ^ by assumption, we have lim P(F^^ ) —0. 

Next, we wish to show F M,k,ri ^ ^ <5}. Consider the Taylor expansion 

of —Mn,k{,^N) intermediary estimator : 


— Mn,k0N) — ~^n,k0n,k) + 
where rn,k is the remainder term with element \{^N~^n,ky X]r=i ^ 


-d^tpj{zki,(3l) r ^ a ^ 
8083' ^PN-f^n,k) 


for some 01 between 0^ and 0^^i^ 

Summing over k, 

K K K K 

0 = ~ ^ Mn,k{,0N) = ~ ^ ^n,k{0n,k) + ^ ^n,k{0n,k)i.flN ~ ^n,k) + ^ 

k=l k=l k=l k=\ 

Rearranging terms and recalling that An,k{0n,k) = ^n,k, we hnd 

” ” K K K 

An.fc) ^ ^ ^n,k- 


K 


K K K 

“/^Af + (^ + '^^Mn,k{0, 

k=l k=l k=l 

Using the dehnition of the CUBE estimator /3^, the 

K K 

' P N ~ ^ An^k) ^ ^ ^n,k 

k=l k=l 


8) = (J2‘ 


k=l k=l k=l k=l 

above relation reduces to 


3. 
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and 



/ K \~^ 



Wf^K ~ ^n\\ ^ 



1 ^ 


\ k=l / 


k=l 


For the first term, according to (C4’), 


— V A 

riK 2^k=l ^ 


n,k 


-1 


< ^ since A„^fc(/3) is a 


continnons fnnction of (3 (according to (Cl)) and is in the neighborhood of /^g for 
small enongh r]. For the second term, we introdnce set B^{j3Q) = {/3|||/3 — /3o|| < rj}. For 
all oj G we have f3\ G i?^(/3o) since is a convex set and G Bn^fBo). 

According to (C5), for small enongh rj, satisfies (C5) and thns f3l satishes (C5). 

Hence we have ||f„fc|| < C2pn\\f3j^ — ^nkW^ ^ ^ ^N,K,rj when rj is small enongh. 

Additionally, 


||fn,fe|| < C2Vnf^N - < C2pn(||/3^ - /3o||2 + 11/3^^^ - /3g||2) 

< 2C2pn^~^^T]‘^. 


Conseqnently, 

where C = ^. 

Therefore, for any <5 > 0, there exists rjs > 0 snch that Cr]j < 6. Then for any a; G F N,k,rjs 
and K = 0(n'’'), where 7 < min{l — 2a,4a — 1}, we have -\/iV||3x — ^n\\ — 0{n^^ )A 
Therefore, when n is large enongh, TN,k,ri ^ { 1 ^ G ~ PkW ^ ^md thns 

P{VN0j^ - ^^11 > 5) < ^ 0 as n ^ 00 . ■ 


F: 


Proof of Proposition 


3.5 


Snppose Xfc does not have fnll colnmn rank for some accnmulation point k. For ease of 
exposition, write = DiagI Sl^Wki )• Note that for generalized linear models with y^i 


SIO 



















from an exponential family, W^i = l/v(/ifcj) where v{fiki) is fhe variance fnnction. The 


IRLS approach is then implemented as follows. For t = 1, 2,.. 


W<'' = 

/3<'' = (XiW<‘-‘'Xi)-XiW‘‘-‘'zi‘-‘> 
v't' = Xi/3<‘', 

As Xfc is not of fnll rank, nses a generalized inverse and is not nniqne. Since 
is a diagonal positive dehnite matrix, there exists an invertible matrix V snch that 


WW = V'V, where V = JWl.'b We thus have 


r(t) 


rjf = V-'VXfc{(VXfc)'(VXfc)}-(VXfc)'V'-^Wr^'Z 


/-i r/-lTTr(i-l)r 7 p-l) 


Therefore, for t = 1, is unique no matter what generalized inverse of X'^W^X^. we 
use, given the same initial value Furthermore, since and depend on 

only through and thus are also invariant of the choice of generalized 

inverse. Similarly, we can show that for each iteration, z[,*^ and are unique no 

matter what generalized inverse of X'^W^* ^^X^ we use, given the same initial values. 

Now, the only problem left is whether the IRLS algorithm converges. We next show 
that converges under a special generalized inverse of Let X|, denote a 

Uk X p* full rank column submatrix of X^. Without loss of generality, we assume the p* 
columns of X^ are the hrst p* columns of X^. Assume X|, satishes (C1-C3) as given in 
and the IRLS estimates converge to ^9^ = (X^'WfcX^)“^X^'WfcZfc, where /3^ 


Section 


3.3 


is the p* X 1 vector of regression coefficients corresponding to X|,. Since is a full column 
rank submatrix of X^, there exists a p* x p matrix P such that X^ = XZ,P, where the hrst 


Sll 




p* X p* submatrix is an identity matrix. We thus have, 


= 






0 


0 






0 




Thus, for that special generalized inverse, converges to {p^ 0)^ By the uniqueness 

property given above, converges no matter what generalized inverse we choose. 

Upon convergence, Z3^*^ = = (X.'^.WkXk)~X'^.WkZk and = X'^WfcXfc. As in 

the normal linear model case, An^^k(3n^.,k is invariant to the choice of A~^ as it is always 
X'^WfcZfc. Therefore, the combined estimator Z^wif is invariant to the choice of generalized 
inverse A~^ ^ of An^^^. Similar arguments can be used for the online estimator ■ 
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